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1 .  SUMMARY 

In  this  Quarterly  Progress  Report,  we  present  our  work 
performed  during  the  period  September  8,  1979  to  December  7,  1979. 

1.1  Introduction 

Our  research  during  this  past  quarter  was  divided  among  the 
areas  of  natural  phonetic  synthesis,  phonetic  recognition,  and  a 
multirate  speech  compression  system.  The  phonetic  recognition  and 
synthesis  programs  will  operate  together  as  a  very  low  rate 
phonetic  vocoder.  Each  of  the  following  subsections  (1.2,  1.3, 
1.4)  refers  to  one  section  of  the  remainder  of  the  QPR  (Sections  2, 
3 ,  and  4)  . 

1.2  Synthesis 

During  this  quarter  we  achieved  a  major  milestone  in  the 
phonetic  synthesis  project  by  completing  the  transcription  of  the 
initial  set  of  diphone  utterances.  This  means  that  we  are  now 
capable  of  synthesizing  any  arbitrary  English  sentence.  There 
will,  however,  be  some  additional  labelling  effort  associated  with 
the  exhaustive  testing  of  the  diphone  data  base. 

We  have  also  added  new  testing  modules  to  the  synthesis 
program  and  have  been  testing  diphones  both  in  isolation  and  in 
complete  sentences.  Only  relatively  minor  changes  and  corrections 
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have  been  made  this  quarter  to  the  synthesis  program  itself.  We 
feel  that  the  program  design  is  now  stable. 

The  labelled  data  base  is  described  in  Section  2.1.  The 
progress  made  on  testing  is  discussed  in  Section  2.2,  and  the  few 
program  changes  are  listed  in  Section  2.3. 

1.3  Phonetic  Recognition 

Preliminary  versions  of  the  diphone  network  compiler  and 
diphone  matcher  were  implemented  during  this  past  quarter.  The 
design  of  the  diphone  recognition  system  also  underwent 
considerable  change.  These  changes,  and  in  particular,  the  scoring 
philosophy  embodied  in  the  programs,  are  discussed  in  Section  3.1. 
The  status  of  the  program  implementation  is  reported  in  Section 
3.2.  In  Section  3.3  we  outline  the  planned  tasks  in  recognition 
for  the  remaining  six  months  of  the  current  contract  year.  Though 
there  will  be  some  amount  of  research  into  better  program  design 
and  scoring  metrics,  the  bulk  of  the  remaining  effort  will  be 
expended  in  training  the  recognition  system  on  large  amounts  of 
speech  to  improve  its  performance. 

1.4  Multirate  Coding 

At  the  beginning  of  this  quarter,  the  quality  of 
transform-coded  speech  was  somewhat  rough.  For  the  full-band  ATC 
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system  operating  at  16  kb/s,  the  roughness  is  due  to  quantization 
noise,  while  for  the  multirate  system  operating  at  9.6  kb/s  or 
below  the  roughness  is  mainly  due  to  the  regeneration  of  missing 
high-frequency  components.  In  this  report,  we  discuss  the 
full-band  16  kb/s  case.  We  were  able  to  achieve  a  substantial 
improvement  in  coder  performance  by  an  improved  bit-allocation 
scheme  and  by  optimum  quantization.  These  matters  are  discussed  in 
Section  4. 

As  for  the  multirate  case,  its  output  speech  quality  is  mostly 
governed  by  the  HFR  (high-frequency  regeneration)  technique  used. 
For  that  reason,  we  are  currently  working  on  improved  HFR  methods. 
We  defer  any  discussions  on  the  multirate  system  until  the  next 
quarterly  progress  report. 
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2.  SYNTHESIS 

During  this  quarter  we  achieved  a  major  milestone  in  the 
phonetic  synthesis  project  by  completing  the  transcription  of  the 
initial  set  of  diphone  utterances.  This  means  that  we  are  now 
capable  of  synthesizing  any  arbitrary  English  sentence.  There 
will,  however,  be  some  additional  labelling  effort  associated  with 
the  exhaustive  testing  of  the  diphone  data  base. 

We  have  also  added  new  testing  modules  to  the  synthesis 
program  and  have  been  testing  diphones  both  in  isolation  and  in 
complete  sentences.  Only  relatively  minor  changes  and  corrections 
have  been  made  this  quarter  to  the  synthesis  program  itself.  We 
feel  that  the  program  design  is  now  stable. 

The  labelled  data  base  is  described  in  Section  2.1.  The 
progress  made  on  testing  is  discussed  in  Section  2.2,  and  the  few 
program  changes  are  listed  in  Section  2.3. 

2.1  Diphone  Data  Base 

As  of  the  end  of  this  quarter,  we  have  completed  the  initial 
transcription  (labelling)  of  the  diphone  data  base.  There  are 
currently  2652  diphones  in  the  data  base.  The  ongoing  testing 
(Section  2.2)  presently  indicates  that  about  10%  of  these  diphone 
labels  will  require  modification.  As  we  test  the  synthesizer  with 
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complete  sentences,  we  may  find  that  we  need  a  few  additional 
diphones.  These  will  primarily  be  diphones  in  a  particular 
phonetic  context  that  could  not  be  taken  from  the  unconstrained 
context. 

In  Appendix  A  we  list  by  category  all  the  different  diphones 
currently  in  the  data  base.  Appendix  B  contains  an  alphabetized 
list  of  the  2652  diphones. 

2.2  Diphone  Testing 

Now  that  the  basic  inventory  of  diphones  is  complete,  we  can 
synthesize  any  arbitrary  English  phoneme  sequence.  All  that 
remains  in  this  phase  of  the  synthesis  project  is  to  test  all  the 
diphones  exhaustively  using  the  automatic  test  sequence  generators, 
and  also  to  synthesize  a  number  of  full  sentences  to  verify  that 
the  speech  sounds  natural. 

Last  quarter  we  described  a  feature  in  the  phonetic 
synthesizer  that  automatically  generated  CVC  sequences  to 
facilitate  testing  of  the  CV  and  VC  diphones.  This  past  quarter, 
we  added  a  sequence  generator  to  test  CC  diphones.  This  new 
sequence  generator  synthesizes  nonsense  strings  of  the  form: 

~  Ci  V  C2  Cj.  V  C2  “• 
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To  illustrate,  consider  the  testing  of  the  diphone  [N  S]  .  The 
testing  program  would  synthesize  the  following  nonsense  sequence: 

-  S  EH  N  S  EH  N  - 

The  [N  S]  diphone  is  used  in  the  middle  of  this  sequence.  We  will 
shortly  be  implementing  another  option  that  will  generate  sequences 
to  test  W  diphones  in  a  similar  manner. 

At  present,  we  have  completed  testing  of  approximately  10%  of 
the  diphones.  The  procedure  began  quite  slowly,  because  we 
encountered  several  diphones  that  did  not  sound  natural.  However, 
as  we  corrected  these,  we  also  were  able  to  generalize  the  changes 
to  many  other  related  diphones,  so  that  the  testing  is  now 
progressing  more  rapidly.  Most  of  the  remaining  errors  are 
expected  to  be  due  to  accidental  misplacement  of  phoneme  boundaries 
in  the  short  nonsense  utterances  from  which  the  diphones  were 
extracted. 

In  addition  to  testing  the  diphones  exhaustively  in  isolation, 
we  shall  be  testing  them  by  synthesizing  a  large  number  of  full 
sentences.  Since  completing  the  diphone  inventory  we  have 
synthesized  seven  complete  sentences.  These  tests  pointed  out  a 
small  number  of  labeling  and  program  errors  which  were  then 
corrected.  The  quality  of  these  sentences  is  markedly  improved 
over  that  which  we  had  obtained  previously.  We  attribute  this 
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improvement  to  the  small  but  significant  program  changes  noted 
below,  and  the  fact  that  we  now  have  all  of  the  diphones  we  need  to 
synthesize  unconstrained  material. 

2.3  Synthesis  Program  Changes 

Changes  to  the  synthesis  program  this  quarter  have  consisted 
of  the  addition  of  the  automated  testing  modules,  and  the 
correction  of  some  program  bugs  in  the  diphone  concatenation 
subroutines.  The  automated  testing  modules  have  been  described 
above,  as  well  as  in  previous  reports.  The  few  program  bugs  which 
we  have  corrected  appear  to  be  the  last  of  the  logical  errors  of 
this  type.  However,  as  we  test  large  amounts  of  data  we  may  arrive 
at  modifications  to  the  reported  procedures  that  result  in  more 
natural  sounding  synthesis. 
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3.  PHONETIC  RECOGNITION 

A  good  deal  of  the  time  spent  last  quarter  on  diphone  template 
recognition  was  used  to  design  the  system.  We  continue  to  present 
this  design  in  Section  3.1  by  describing  the  scoring  philosophy  in 
considerable  detail.  During  this  past  quarter,  we  implemented  the 
bulk  of  the  programs  necessary  for  the  basic  design  as  presented. 
This  implementation  work  is  described  in  Section  3.2.  The  kind  of 
work  that  we  anticipate  during  the  remainder  of  the  project  is 
spelled  out  in  Section  3.3. 

3.1  Scoring  Philosophy 

The  scoring  philosophy  plays  a  particularly  important  part  in 
the  operation  of  our  diphone  template  recognition  system.  Simply 
stated,  the  goal  of  the  recognition  system  is  to  pick  the  most 
probable  phoneme  sequence  given  the  sequence  of  input  speech 
spectra.  The  scoring  philosophy  determines  how  such  probabilities 
are  to  be  accurately  calculated. 

In  this  section  we  will  systematically  derive  our  scoring 
philosophy  and  indicate  how  far  the  current  implementation  goes 
towards  its  realization.  In  this  derivation  the  scoring  philosophy 
is  spelled  out  in  an  incremental  sense.  In  order  to  do  this  we 
start  with  a  simple  expression,  which  represents  the  probability  of 
a  particular  path  given  the  input,  and  systematically  decompose  it 
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via  a  sequence  of  equivalence  transformations  and  approximations 
into  the  product  of  many  simple  expressions.  Since  each  of  these 
simple  expressions  can  be  calculated  independently,  a  scoring 
adjustment  can  be  made  {to  previous  path  scores)  for  each 
additional  input  spectrum.  As  we  proceed,  many  of  the  more  salient 
approximations  will  be  pointed  out  and  justified. 

At  present  we  are  modeling  the  input  speech  spectra  by  the  LPC 
coefficients  plus  the  associated  gain. 

Let  1^,  i  =  1 ,N,  be  the  sequence  of  spectral  frames  modeling 
the  input,  and  let  pj|,  k  =  1,M^,  be  the  jth  path  of  spectral  frames 
through  the  network,  where  the  number  of  input  frames  is  at  least 
as  great  as  the  number  of  frames  in  path  j  (N>M^ )  for  every  j. 
Also,  let  Prob^  be  the  probability  of  the  jth  path  given  the  input: 

Prob^  =  Prob(pj,  i=l,M^|l^,  k=l,N) 

Specifying  any  unique  path  through  the  network  indirectly 
determines  a  unique  phoneme  sequence  (since  the  phonemes  are 
identified  by  labelled  nodes  along  the  path) .  Currently,  since 
only  a  single  path  directly  connects  any  two  phonemes  in  the 
network,  the  converse  is  also  true,  i.e.,  specifying  a  unique 
diphone  sequence  determines  a  unique  network  path.  This  second 
part  will  not,  in  general,  be  true  when  additional  paths  are  added. 
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Noting  that  our  recognition  goal  is  to  determine  the  most 
probable  path,  we  now  consider  how  to  score  a  single  path.  For 
notational  convenience  we  shall  remove  the  superscript  used  to 
identify  which  of  the  j  paths  through  the  network  is  being  scored. 
We  begin  the  decomposition  of  this  probability  with  the  following 
equivalence  relationship,  using  Bayes"  Rule: 


Prob (PiP2 •  • 

1  *1*2  *  *  JN>  = 

(1) 

Prob (P^P2* 

.  PM) *Prob(I1I2.  .  IN|P1P2. 

*  /Pr°b  ( 1 1 1 2 • 

XN> 

The  third 

term  of  this  expression, 

Prob(I^I2.  .  Ijg)  , 

is 

independent  of 

the  particular  path  through  the  network 

and 

therefore  does  not  need  to  be  evaluated  in  order  to  correctly  find 
the  most  probable  path.  Note,  however,  that  for  incomplete  path 
scores  to  be  meaningfully  compared,  each  path  should  span  the  same 
input.  Having  made  this  observation,  we  will  not  deal  with  the 
evaluation  of  Probflj^.  .  IN)  here. 

The  first  term,  Prob(Pj_P2.  .  PM)  ,  is  the  a  priori  probability 
of  the  particular  path  being  considered.  Since  the  path  actually 
is  composed  of  a  sequence  of  diphone  models  and  each  diphone  model 
is  itself  defined  by  a  sequence  of  spectral  frames,  this 
probability  is  exactly  the  same  as  the  a  priori  probability  of  the 
particular  diphone  sequence.  The  approximation  made  in  the  current 
implementation  is  to  assume  that  the  presence  of  each  diphone  in 
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the  sequence  is  independent  of  the  other  diphones,  as  long  as  the 
implied  context  is  satisfied.  For  example,  diphones  whose  right 
phoneme  is  X  can  only  be  followed  by  diphones  whose  left  phoneme  is 
X. 

The  most  important  (and  interesting)  part  of  the  path  score 
will  result  from  the  second  term,  Prob(I1I2.  .  in|Pip2*  •  PM^  • 
This  term  tells  how  probable  it  is  that  the  observed  input  would 
have  been  produced  if  it  was  known  to  have  been  produced  as  a 
result  of  "speaking"  along  the  specified  path. 

Remember  that  N>M  since  each  of  the  spectral  frames  in  the 
path  must  have  at  least  one  corresponding  spectral  frame  in  the 
input.  Certain  precautions  have  been  taken  to  insure  that  this  is 
a  reasonable  constraint.  For  example,  a  variable  frame  rate 
algorithm  is  used  by  the  network  compiler  to  insure  that  each 
diphone  model  consists  of  a  sequence  of  substantially  different 
spectral  frames. 

The  next  step  in  the  decomposition  of  (1)  consists  of  relating 
the  evaluation  of  the  important  second  term  to  alignment 
considerations.  An  alignment  is  the  correspondence  between  the 
frames  in  the  input  and  the  frames  in  the  path  being  scored. 
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Prob  ( 1 1 1 2  *  •  ^nIp1p2*  *  PM^  = 

Num 

2  Prob  (Alignment  j  )  *Prob  (Ii  I-j . 
i=l  1  x  ^ 


(2) 

inIp1p2*  •  PM  Alignment^) 


where  Num  is  the  number  of  different  possible  alignments. 


What  this  means  is  that  an  exact  evaluation  of  (2)  requires 
knowledge  of  all  possible  ways  in  which  to  align  the  given  path 
with  the  observed  input.  At  this  point  we  detail  a  few  of  the 
assumptions  which  permit  a  further  evaluation  of  this  probability. 
The  first  assumption  is  that  the  time  ordering  of  spectral  frames 
of  both  diphone  models  and  input  speech  is  significant.  A  second 
assumption  is  that  in  defining  any  particular  alignment  the  only 
relevant  thing  that  can  be  said  is  how  many  input  frames  are  to  be 
associated  with  each  of  the  spectral  frames  in  the  path.  Since  we 
have  already  asserted  that  each  of  the  spectral  frames  in  the  path 
must  have  at  least  one  corresponding  frame  in  the  input,  the 
alignment  can  be  completely  defined  by  a  sequence  of  M  durations, 
D-j^ .  .  Dm,  one  duration  for  each  of  the  spectral  frames  in  the 
path. 


Since  each  alignment  will  be  characterized  by  a  different  set 

of  durations,  we  only  need  to  consider  the  evaluation  of  the 

probability  for  an  arbitrary  alignment.  Note  that  the  number  of 

M 

input  frames  is  equal  to  the  sum  of  these  durations,  (N  *  2  D^) , 

k  ”  1 

since  each  spectral  frame  in  the  input  is  aligned  with  one  (and 
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only  one)  of  the  spectral  frames  in  the  path.  Let  us  consider  the 
evaluation  of  an  arbitrary  alignment. 

Prob (Alignment)  =  Prob(D]_D2.  .  DM)  This  is  the  joint 
probability  that  P^  corresponds  to  D]_  frames  of  input,  P2 
corresponds  to  D2  frames  of  input,  ...  ,  and  that  PM  corresponds  to 
Dm  frames  of  input.  Note  that  this  alignment  probability  is 
independent  of  any  particular  sequence  of  input  spectral  frames. 
We  approximate  P(D1D2.  .  DM)  as  P  (D]_)  *P (D2)  * . * . *P (DM)  .  This  is 
equivalent  to  assuming  that  the  alignment  of  each  frame  in  the  path 
is  independent  of  the  alignment  of  every  other  frame  in  the  path. 
Having  selected  Alignment  (D-j^ .  •  °M>'  and  remembering  that  the 
alignment  of  each  path  frame  has  been  assumed  to  be  independent  of 
all  other  path  frames,  we  note  that: 

Prob(I1I2.  .  IN|PiP2.  .  PM  Alignment (D^^.  .  DM) )  =  (3) 

Prob (Ix .  .  ID  |PX.  .  )* 

Prob(ID^+^.  .  iD-^+D2  I  *  P2*  *  II*  * 

Prob  ( IDl+02+ .  .  .  +Dm_1+1  '  ‘  ID1+D2+...+Dm_1+DmI*  •  ¥l-  •  IDm.1> 

In  writing  (3)  the  correspondence  between  each  path  frame  and  its 
corresponding  input  frame (s)  is  implied  by  notation.  We  now  assume 
that  the  probability  of  every  input  frame  depends  only  on  the 
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particular  path  frame  to  which  it  is  aligned .  This  results  in  the 
following  simplification: 

Prob(I1I2.  .  InIp1p2*  *  PM  Alignment (did2.  .  DM) )  =  (4) 

Prob (I1#  .  IDi|P1)* 

Pr°b(ID1+l.  *  ID^+D2IP2^* 

pr°b(IDi+D2+. . .+Dm_1+1*  *  ID1+D2+. . .+DM_1+DMI PMJ 

We  now  make  the  additional  assumption  that  the  probability  of  any 
particular  input  frame,  given  a  corresponding  path  frame,  is 
independent  of  other  input  frames.  The  right-hand  terms  in  (4)  can 
then  be  further  expanded  as: 

Prob ( I -|_1 2 .  •  Ijg|p^P2.  .  Pjj  Alignment (D^D2 .  .  Dj^) )  =  (5) 

Probfl-jJPjL)  *. .  .*Prob(IDJP1)  * 

Prob (IDi+1 [ P2) *  —  *Prob  dDl+D2 I P2>  * 

Prob ( IDi+D2+  # # , +DM_1+1 I PMJ  * • • *  *Prob ( ID1+D2+ . . . +Dm-1+dM  ^  Pfl) 
Notice  that  the  last  referenced  input  frame, 

iD1+D2+.  .  ,+Dm  i+Dm'  ^rame  j^N*  Although  the  scoring  philosophy 

and  a  few  assumptions  and  approximations  have  permitted  us  to 
proceed  this  far,  we  can  see  that  several  important  issues  still 
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remain  to  be  resolved  before  a  scoring  algorithm  can  be 
implemented. 

The  first  of  these  issues  involves  the  evaluation  of  the 
probabilities  indicated  in  (5) .  All  of  the  probabilities  are  of 
the  same  basic  form:  Prob(l|P).  The  probability  of  a  particular 
input  spectral  frame  must  be  calculated  given  the  particular  path 
spectral  frame  of  which  it  is  an  instance.  Currently  the  weighted 
Euclidean  distance  between  the  two  spectral  frames  is  used  as  a 
scoring  metric  in  lieu  of  the  log  probability.  This  is  equivalent 
to  assuming  that  every  path  frame  has  a  Gaussian  distribution  of 
input  spectra.  The  mean  of  each  such  distribution  is  the  path 
spectral  sequence  itself.  A  generalized  implementation  of  the 
scoring  philosophy  is  discussed  in  Section  3.2.2. 

A  second  implementation  issue  concerns  the  alignment 
probabilities.  Notice  that  the  scoring  philosophy  (as  expanded) 
only  indicated  how  a  known  duration  probability  would  affect  the 
score  of  an  entire  path.  A  decision  has  to  be  made  as  to  what  (if 
any)  score  effect  should  be  applied  during  the  matching  process, 
before  the  final  duration  is  known.  Since  what  is  known  is  that 
the  duration  of  the  current  path  spectral  frame  is  (for  example)  at 
least  j  frames  long,  the  score  is  modified  so  as  to  include  the 


expected  value  of  the  duration  probability. 
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Max 

Expected  value  =  2  Prob (D ( i) ) *Prob  (D  ( i) ) 

i=j 

where  Prob{D(i))  is  the  probability  that  the  duration  of  this  path 
frame  is  i  input  frames  long. 

3.2  Current  Status  -  Implementation 

The  software  work  involved  in  the  diphone  template  recognition 
project  consists  of  two  major  components.  First,  there  is  a 
diphone  network  compiler  which  takes  a  text  file  of  diphone 
descriptions  as  input  and  produces  diphone  network  which  can  be 
used  by  the  matcher  program  during  recognition.  The  second 
component  is  the  matcher,  which  uses  this  compiled  network  and 
produces  the  best  scoring  phoneme  sequence  given  a  sequence  of 
input  spectral  frames. 

3.2.1  Network  Compiler 

During  this  quarter  an  initial  version  of  a  network  compiler 
was  implemented.  It  takes  diphone  definitions  in  the  same  format 
as  COMPOZ  (the  diphone  compiler  used  by  the  synthesis  program)  does 
with  one  minor  difference.  Diphones  without  context  must  appear 
before  those  with  context.  This  permitted  a  considerably  simpler 
algorithm  than  the  one  which  would  have  been  necessary  had  this 
constraint  not  been  imposed.  Eventually,  however,  since  we  would 
like  the  capability  to  do  arbitrary  incremental  additions  to  the 
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network,  the  compiler  will  probably  be  extended.  We  have  used  this 
network  compiler  to  generate  a  complete  network  of  all  the  diphones 
currently  available  (approximately  2600) . 

3.2.2  Matcher 

A  matcher  that  satisfies  the  basic  considerations  in  the 
design  set  forth  in  the  last  QPR  [1]  has  been  implemented  and  is 
currently  running.  These  considerations  include:  1)  a  sound 
scoring  strategy,  2)  continuous  operation,  3)  alignment 
availability  for  training,  and  4)  efficiency.  Since  the  matcher 
appears  to  be  operating  as  it  should  (no  apparent  bugs) ,  our 
attention  is  turning  to  a  generalization  of  the  currently 
implemented  spectral  scoring  procedure. 

The  most  general  scoring  procedure  would  require  a 
nonparametr ic  model  of  the  probability  density  distribution.  For 
example,  if  we  use  samples  of  input  frames  known  to  have  been 
aligned  with  a  particular  path  spectral  frame,  such  a  probability 
can  be  estimated.  (It  is  this  procedure  which  was  alluded  to 
earlier  when  we  indicated  that  the  memory  space  on  the  current 
computer  was  affecting  our  current  implementation.)  Of  course  the 
mere  collection  of  a  sufficient  number  of  instances  of  this  path 
spectral  frame  would  in  itself  be  a  significant  data  processing 
problem.  A  much  simpler  procedure  would  be  to  assume  that  a  single 
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parametric  model  can  accurately  represent  the  probability 
distribution  of  each  path  spectral  frame  if  the  collected  instances 
are  used  to  determine  the  parameter  values.  A  still  simpler 
procedure  would  be  to  assume  that  the  uniform  parametric  model  is 
Gaussian.  The  parameters  in  this  case  consist  of  the  means  and 
covariance  matrix.  To  minimize  both  storage  requirements  and 
computation  time,  the  implemented  scoring  algorithm  adds  the 
logarithm  of  the  probabilities  instead  of  multiplying  the 
probabilities.  The  evaluation  of  the  probability  then  reduces 
primarily  to  the  determination  of  the  Euclidean  distance  between 
the  input  and  path  spectral  frames. 

3.3  Future  Work 

We  anticipate  that  the  following  list  of  activities  will 
require  most  of  our  time  during  the  remainder  of  this  project: 

1)  Debugging,  testing  and  evaluation  of  the  current  system 

2)  Forced  alignment  with  time  boundaries  specified 

3)  Completion  of  statistics  code 

4)  Generalization  of  spectral  scoring 

5)  Training 

Since  we  have  enough  code  implemented  to  permit  the  operation 
of  the  entire  diphone  template  recognition  system,  we  feel  that  it 
will  be  advantageous  to  get  the  best  possible  performance  before 
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resorting  to  all  of  the  statistically  based  techniques.  The  reason 
for  this  belief  is  that  many  word  matchers  and  word  spotters  have 
been  based  entirely  on  a  spectral  matching  technique  (and  dynamic 
programming)  and  have  gotten  relatively  good  performance  for  up  to 
a  few  hundred  words.  While  we  don^t  expect  quite  the  same  level  of 
performance  (we  are  using  much  shorter  segments  and  some  contextual 
variation)  we  do  expect  that  a  reasonable  level  of  performance 
should  be  possible.  Knowing  this  level  will  also  be  handy  as  a 
reference  during  the  testing  training  program.  In  achieving  this 
level  of  performance  we  will  no  doubt  uncover  some  bugs,  which 
might  be  overlooked  if  we  were  to  go  immediately  to  a  more  complex 
system. 

Currently,  forced  alignment  between  input  speech  and  the 
diphone  network  can  only  be  specified  as  an  ordered  sequence  of 
diphones.  Although  this  may  be  sufficient  for  most  of  the  cases 
where  forced  alignment  is  necessary,  it  will  probably  not  be 
sufficient  for  every  situation.  It  may  therefore  be  necessary  to 
include  the  capability  to  specify  the  frame  numbers  in  the  input  as 
an  additional  alignment  constraint  in  these  special  cases.  If  this 
proves  necessary,  not  much  programming  will  be  required  since  it  is 
a  straightforward  addition  to  the  current  forced  alignment  code. 

Code  to  update  the  duration  statistics,  given  a  particular 
path  frame  and  a  new  sample  duration,  has  already  been  written. 
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What  remains  to  be  coded  is  an  interface  to  this  code  which  permits 
a  user  (who  knows  what  the  "correct"  segmentation  and  labeling 
should  be)  to  identify  which  portions  (or  perhaps  all)  of  the 
algorithmically  aligned  input  should  be  used  for  statistically 
updating  duration  distributions.  It  may  be  necessary  for  the  user 
to  exclude  certain  portions  of  the  alignment  because  a  really 
suitable  path  does  not  currently  exist  in  the  network.  Permitting 
the  user  to  identify  some  portion  of  the  input  which  was  poorly 
matched  and  include  that  portion  in  the  network  as  a  new  diphone 
path  (or  portion  of  a  diphone  path)  will  probably  be  necessary  as 
well. 


Some  work  on  the  generalization  of  spectral  scoring  will 
undoubtedly  be  done.  How  far  the  spectral  scoring  metrics  can  be 
pushed  depends,  to  a  large  d  gree,  on  the  amount  of  speech  which  we 
can  reasonable  expect  to  train  the  system  on.  Another  possible 
limitation  has  to  do  with  the  amount  of  memory  necessary  to 
represent  the  (potentially)  more  accurate  metrics.  It  is  very  easy 
to  conceive  of  metrics,  whose  implementation  would  require 
substantially  more  memory  than  that  which  is  (currently) 
addressable  on  a  PDP-10.  Unless  an  extension  is  made  to  its 
address  space  (in  a  way  which  permits  such  large  amounts  of  memory 
to  be  easily  accessed)  these  very  general  scoring  techniques  will 
not  be  implemented. 
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Once  the  program  bugs  have  been  eliminated,  a  substantial  part 
of  the  remainder  of  the  project  will  consist  of  training  the  system 
on  input  speech.  Although  it  is  much  too  early  now  to  estimate 
what  level  of  performance  will  be  possible,  we  hope  to  be  able  to 
make  such  an  estimate  later  on  in  the  project.  We  propose  also  to 
present  a  record  of  the  improvement  in  performance  as  a  function  of 
time  (or  sentences  processed  etc.).  This  will  enable  us  to 
determine  more  accurately  just  how  much  training  will  be  necessary 
to  get  a  certain  level  of  performance  and  what  level  of  performance 
we  might  expect  to  achieve. 
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4.  MULT I RATE  CODING 

4.1  Introduction 

During  this  quarter  we  have  concentrated  our  efforts  at 
improving  the  quality  of  transform-coded  speech.  We  have  worked 
mainly  on  the  full-band  16  kb/s  ATC  system,  with  the  knowledge  that 
our  improvements  will  be  applicable  to  both  the  9.6  kb/s  baseband 
coder  and  to  the  general  multirate  system.  Below,  we  discuss  the 
major  cause  of  quality  degradation  in  the  ATC  system:  quantization 
noise.  We  then  discuss  the  two  areas  in  which  we  were  able  to 
achieve  a  substantial  improvement  in  coder  performance: 
bit-allocation  and  optimum  quantization.  For  simplicity,  we  shall 
restrict  our  discussions  to  ATC  without  noise  shaping.  All  our 
results  are  applicable  to  the  case  with  noise  shaping  as  well. 

4.2  Quantization  Noise 

In  this  section  we  first  describe  briefly  the  quantization  of 
the  DCT.  We  then  discuss  the  effect  of  quantization  noise  on 
quality.  In  our  implementation  of  ATC,  we  are  quantizing  the  DCT 
coefficients  of  the  linear  prediction  residual  with  a 
frequency-dependent  step-size  Dj  given  by 


D.-  *  Do  I  A.-  I 


l<i<128 


(6) 
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where  Dg  is  an  experimentally  derived  constant  to  be  discussed 
below,  and  |A^|  is  |A(w)  j,  the  magnitude  of  the  DFT  of  the  linear 
prediction  inverse  filter  A(z) .  Initially,  we  used  uniform  n-bit 
quantizers  with  step-size  Di#  for  0<n<12.  (In  practice,  n  never 
exceeds  10.)  Thus,  the  quantization  of  the  DCT  components  can  be 
represented  by  the  following  two  steps: 

fci  =  Lxi/Dil  (7) 

and  x.  =  t.D.  (8) 

i  ii 

where  t^  is  a  temporary  variable,  [_•]  denotes  taking  the  nearest 

A 

integer  value,  x^  is  the  ith  DCT  component  of  the  residual,  and  x^ 
is  its  quantized  value.  Note  that  the  integer  variable  t^ 
indicates  which  quantization  level  is  used;  it  is  encoded  into  a 
binary  code  and  transmitted  across  the  channel.  Substituting  (6) 
into  (7) ,  we  have 

<4  3  [.(Xi/lAib/Dgl  (9) 

In  (9) ,  the  term  in  parenthesis  has  the  same  energy  and  the  same 
spectral  shape  as  the  DCT  of  the  speech  signal  itself.  Thus,  to  a 
good  approximation,  our  method  is  equivalent  to  quantizing  the  DCT 
coefficients  of  the  speech  signal  with  a  fixed  step-size  Dg.  The 
process  of  quantizing  the  DCT  of  speech  has  been  described  in  the 
literature  and  is  given  the  name  of  ATC.  It  should  maximize  the 
output  signal-to-quantization-noise  ratio  (SNR)  for  a  given  number 
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of  bits.  For  the  full-band  16  kb/s  case,  the  quality  of  the  output 
speech  we  obtained  initially  was  somewhat  rough,  due  to  excessive 
granular  noise  and/or  clipping. 

The  problem  arises  when  we  make  a  choice  of  step-size  Dg  for 
the  uniform  quantizers.  For  reasonably  small  Dg,  the  granular 
noise  is  negligible,  but  the  clipping  errors  are  quite  frequent  and 
large,  causing  an  appreciable  decrease  in  SNR  and  a  severe 
degradation  in  the  quality  of  the  coded  speech.  As  the  step-size 
Dg  is  increased,  the  clipping  problem  is  alleviated,  the  SNR 
increases,  but  the  granular  noise  becomes  more  audible  (in  the  form 
of  roughness) .  For  still  larger  Dg,  the  granular  noise  becomes 
excessive  and  the  SNR  drops  again.  During  the  previous  quarter,  we 
were  able  to  reach  a  suitable  compromise  value  for  Dg,  but  some 
roughness  was  still  perceivable  in  the  coded  speech.  During  this 
quarter,  in  an  attempt  to  improve  the  quality  of  the  coded  speech, 
we  reexamined  the  bit-allocation  process  and  found  that  the 
spectral  model  of  the  signal  must  not  be  quantized  with  a  6-dB 
step-size.  The  modified  bit-allocation  scheme  is  discussed  in  the 
following  section. 

4.3  Bit-Allocation 


Recall  from  the  previous  QPR  that  the  number  of  bits  to  be 
used  at  each  frequency  is  given  by 
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bL  =  bQ  +  log2(l/|Hi| )  l<i<128  (10) 

where  1/|H^|  is  l/|H(w)|,  the  magnitude  of  the  DFT  of  the  spectral 
model  of  the  speech  signal,  and  bg  is  the  average  number  of  bits 
per  sample  for  a  given  bit-rate.  The  spectral  model  of  speech  was 
discussed  in  the  previous  QPR;  briefly,  it  consists  of  two 
components:  a  smooth  (LPC)  spectral  envelope,  and  a  model  for  the 
harmonic  structure  of  the  spectrum  (pitch) .  The  fractional  numbers 

A 

b ^  obtained  in  (10)  are  quantized  to  become  integers  b^,  subject  to 
the  following  constraints:  (i)  no  negative  bit-assignment  is 

/N 

allowed,  and  (ii)  the  sum  of  integer  b^  must  equal  B,  the  number  of 
available  bits  per  frame.  This  integer ization  process  is  given  by 

bj  =  max{0,  I  bi+s"! }  (11) 

1  128  1 

such  that  £  bi=B  (12) 

i=l  1 

The  adjustment  constant  8  in  (11)  is  varied  iteratively  until  (12) 
is  satisfied.  The  process  depicted  in  (11)  is  in  fact  quantization 
of  the  spectral  model  of  the  speech  signal  on  a  logarithmic  scale. 
To  see  that,  we  rewrite  (10)  as 

bi=b0+£  logjd/lHj2) 

or,  bi=bo+6^02lo9io  d/lHi  1 2)  (13) 

In  (13),  we  recognize  the  familiar  term  10  log10 ( 1/ | | 2 ) ,  which  is 
the  spectral  model  of  the  speech  signal  expressed  in  decibels. 
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Thus,  from  (13)  and  (11) ,  it  becomes  clear  that,  to  obtain  the 

A 

allocated  bits  b^,  we  are  quantizing  the  spectral  model  with  a  6.02 
dB  step-size  (and  an  offset  equal  to  bg+B) .  Note  that  this  uniform 
quantization  process  begins  by  a  division  as  shown  in  (13) ,  while 
the  rounding-off  to  integers  is  shown  in  (11) .  The  6-dB  step-size 
interpretation  of  the  bit-allocation  scheme  has  been  mentioned  in 
the  literature.  We  shall  now  denote  this  step-size  by  S  and  (13) 
can  be  rewritten  as 

bj.  =  b0  +  10  [log10  ( 1/ 1 H i  !  2 )  3  /S  (14) 

To  alleviate  the  clipping/granular  noise  tradeoff  problem 
discussed  in  Section  4.2,  we  decided  to  change  the  6-dB  step-size 
shown  in  (13) .  In  particular,  one  can  show  that  for  a  step-size  S 
smaller  than  6  dB,  clipping  is  less  likely  to  occur  in  the  n-bit 
quantizers,  for  the  same  quantization  step-size  Dg.  Thus,  one  can 
choose  a  small  value  of  Dg  to  minimize  the  amount  of  granular  noise 
and,  at  the  same  time,  adjust  S  such  that  the  likelihood  of 
clipping  is  also  minimized.  We  hasten  to  say  here  that  it  is  not 
always  possible  to  find  a  suitable  value  of  S  for  an  arbitrarily 
small  Dg.  Also,  for  S<<6,  a  greater  percentage  of  DCT  components 
are  quantized  with  zero  bits.  The  0-bit  allocation  is  a  problem  in 
ATC;  we  shall  discuss  it  in  Section  4.5. 
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Informal  listening  over  a  set  of  10  sentences,  showed  that  the 
quality  of  the  coded  speech  was  much  improved  with  a  value  of  S  of 
either  5  or  4  dB,  relative  to  the  case  where  S=6  dB.  In  fact,  this 
improvement  in  quality  was  also  accompanied  by  an  average  increase 
of  2  dB  in  the  output  segmental  SNR. 

4.4  Optimum  Quantization 

The  second  area  where  we  were  able  to  improve  the  SNR  of  the 
coded  speech,  is  to  replace  the  uniform  quantizers  by  more  optimal 
ones.  To  that  end,  we  collected  the  statistics  of  the  inputs  to 
the  n-bit  quantizers.  We  found  those  inputs  to  have  a  Gaussian  pdf 
(probability  density  function  or  histogram) ,  as  has  been  reported 
by  others.  Thus,  to  maximize  the  SNR,  we  decided  to  use  Max's 
optimum  non-uniform  quantizers  designed  to  minimize  the 
mean-squared  quantization  error  for  Gaussian  inputs  [2] .  For  the 
same  set  of  10  sentences,  the  non-uniform  optimum  quantizers 
yielded  an  average  increase  of  1.2  dB  in  SNR  over  the  case  with 
uniform  quantizers,  for  S=4  or  5  dB.  However,  perceptually,  it  was 
difficult  to  detect  any  change  in  the  quality  of  the  coded  speech. 

Finally,  we  have  recently  realized  that  the  SNR  of  non-uniform 
quantizers  optimized  for  a  Gaussian  pdf  does  not  increase  at  the 
rate  of  6  dB  per  bit.  The  values  of  the  SNR  of  the  n-bit  Max 
quantizers  for  0<n^5  are  shown  in  Table  I. 
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Table  I:  Signal-to-noise  ratio  SNR (n)  as  a 

function  of  the  number  of  bits  for  Max's 
optimum  non-uniform  quantizers. 

A SNR  is  the  first  forward  differc.  e. 

From  that  table,  it  is  clear  that  the  increase  in  SNR  approaches  6 
dB  for  each  1-bit  increase  only  for  large  n.  In  ATC,  large  values 
of  n  are  seldom  used:  the  quantizers  most  often  used  (95%  of  the 
time)  are  for  0£n<3.  Thus,  Table  I  not  only  explains  the 
improvement  in  performance  for  S«4  or  5  dB,  but  also  indicates  the 
use  of  a  quantization  scheme  in  (14)  with  unequal  steps.  We  are 
now  in  the  process  of  testing  the  newly  modified  bit-allocation 
scheme. 

4.5  Conclusions 

At  present,  the  speech  quality  for  the  full-band  coder  at  16 
kb/s  is  much  improved  relative  to  what  it  was  at  the  beginning  of 
this  quarter.  The  quality  of  the  coded  speech  is  quite  close  to 
that  of  the  original.  Quantization  noise  is  almost  never 
perceived.  However,  one  can  perceive  very  low  level  clicks  in  the 
coded  speech.  We  were  able  to  determine  that  the  cause  of  such 
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clicks  is  related  to  the  allocation  of  zero  bits  to  certain  DCT 
components.  We  performed  an  experiment  where,  following  the  usual 
bit-allocation  with  S=5,  we  allowed  the  use  of  a  1-bit  quantizer 
for  all  samples  that  would  normally  be  quantized  with  zero  bits. 
Informal  listening  tests  showed  that  all  clicks  disappeared  from 
the  coded  speech.  However,  the  above  described  solution  causes  an 
increase  in  bit-rate.  We  are  now  seeking  solutions  to  this  problem 
without  increasing  the  bit  rate.  We  are  also  in  the  process  of 
testing  the  modified  bit-allocation  scheme  which  employs 
non-uniform  quantization  of  the  logarithm  of  the  spectral  model. 
And  finally,  we  are  testing  the  performance  of  the  multirate  coder, 
at  9.6  kb/s  or  below,  with  the  improvements  discussed  in  this 
report. 
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APPENDIX  A  D I PHONE  CATEGORIES 

This  Appendix  contains  a  description  of  the  eight  major 
categories  of  diphones.  Within  each  category  is  a  definition 
of  the  set  of  all  possible  diphones  of  that  type,  and  the 
number  of  total  diphones  in  that  category.  In  order  for  the 
reader  to  understand  the  notation  that  is  used,  we  will  examine 
the  first  category  in  some  detail,  and  then  note  other  notation 
in  the  remaining  categories.  The  listing  of  the  categories 
begins  at  the  conclusion  of  these  explanatory  paragraphs. 

The  first  category  is  entitled  C^V-^  (for  Consonant-Vowel) 
diphones.  The  number  597  in  parentheses  indicates  that  there 
are  597  total  diphones  in  this  category.  Below  the  category 
name  are  the  sets  of  possible  phonemes  that  can  be  substituted 
for  C,  and  V-,  .  C^  can  be  any  one  of  the  25  consonants  listed, 
and  v,  can  oe  any  one  of  the  24  vowels.  To  generate  the 
possible  C,V,  diphones,  we  begin  by  letting  [P]  be  substituted 
for  C-,  ,  and  let  each  of  the  24  vowels  be  substituted  in  for  V-,  . 
This  generates  the  diphones  [P  IY]  ,  [P  IH]  ,  [P  EY]  ,  [P  EH]  ,  and 
so  on  up  to  [P  AXR]  .  We  then  let  C-^  be  [T]  and  begin  again, 
which  produces  [T  IY]  ,  (T  IH]  ,  and  so  on  to  [T  AXR]  .  Below  the 
definitions  of  C^  and  are  three  Exceptions,  meaning  that 
while  these  diphones  would  be  generated  by  the  above  procedure, 
they  are  not  in  the  data  base.  In  this  case,  these  three 
diphones  do  not  occur  in  English,  i.e.,  these  phoneme 
combinations  do  not  occur.  The  total  number  of  diphones  can  be 
easily  calculated  by  multiplying  the  number  of  possible  C^ 
phonemes  by  the  number  of  Vi  phonemes  and  subtracting  the 
Exceptions.  In  this  case,  25*24-3=597. 

The  category  V-jV^  diphones  contains  the  first  example  of 
Additions.  Additions  are  diphones  which  fit  into  the  general 
category  (e.g.  Vowel-Vowel) ,  but  which  are  not  captured  by  the 
listing  procedure. 

In  the  Stop  Consonant  Diphones  with  Context  category,  we 
describe  the  first  diphones  which  contain  the  surrounding 
phonetic  context  in  their  specification.  At  this  point  we  also 
introduce  a  new  phoneme  variable  X,  which  is  defined  as  the  set 
of  eight  stop  consonant  phonemes.  As  described  in  early 
reports,  diphones  in  context  are  indicated  by  the  following 
notation: 

fb  /  ca  s  cb 


31 


Report  No.  4277 


Bolt  Beranek  and  Newman  Inc. 


where  P.P^  give  the  two  phonemes  defining  the  diphone,  the  "/" 
means  "in  the  context  of",  and  the  indicates  the  location 
of  the  diphone  PaPw  with  preceding  context  CQ  and  following 
context  .  Each  of  these  contexts  can,  in  general,  be  up  to  3 
phonemes  in  duration  and  include  single  phonemes,  or  classes  of 
phonemes  grouped  in  square  brackets. 
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C1V]_  DIPHONES  (597) 


p 

T 

K 

B 

D 

G 

CH 

JH 

F 

V 

TH 

DH 

s 

Z 

SH 

ZH 

W 

Y 

R 

L 

M 

N 

HH 

NX 

IY 

IH 

EY 

EH 

AE 

AA 

AO 

AH 

OW 

UH 

UW 

ER 

AY1 

0Y1 

AW 

YU 

IR 

OR 

AR 

EYR 

UR 

AX 

IX 

AXR 

Exceptions:  [W  YU]  [Y  YU]  [R  YU] 


V2C2  DIPHONES  (597) 

V2  =  V,  except  [AY1]  is  replaced  by  [AY2] 
and  [0Y1]  is  replaced  by  [0Y2] 

C2  =  SIP  SIT  SIR  SIB  SID  SIG  SIC  SIJ  F  V  TH  DH 

A  S  Z  SH  ZH  W  Y  R  L  M  N  HH  NX  EL 

Exceptions:  [YU  W]  [YU  Y]  [YU  R] 


V3VX  DIPHONES  (242) 

V3  =  AA  AO  AW  AY 2  IY  ER  EY  OW  OY2  UW 
s  defined  above 

Additions:  [ AY1  AY2]  [0Y1  OY2] 


C3C4  DIPHONES  (483) 


C3  5 

P 

T 

K 

B 

D 

G 

CH 

JH 

F 

V 

TH 

DH 

s 

Z 

SH 

ZH 

R 

L 

M 

N 

NX 

c4  = 

SIP 

SIT 

SIR 

SIB 

SID 

SIG 

SIC 

SIJ 

F 

V 

TH 

DH 

s 

Z 

SH 

ZH 

W 

Y 

R 

L 

M 

N 

HH 

33 


Report  No.  4277 


Bolt  Beranek  and  Newman  Inc 


Stop  Consonant  Diphones  with  Context  (544) 

SIX  X  /  &  Vi  =  defined  above  (192) 

SIX  X  /  &  C4  C4  =  defined  above  (184) 

SIX  X  /  C3  &  C3  =  defined  above  (168) 

X  =  P  T  K  B  D  G  CHJH 


Diphones  with  Silence  (81) 

-  V4  V4  =  VL  less  [AX]  [IX]  [AXR]  (21) 

-  C4  C4  =  defined  above  (23) 

Vo  -  Vo  =  defined  above  (10) 

c5  -  C5  =  Co  less  [R]  [L]  (19) 

SIX  X  /  &  X  =  defined  above  (8) 


[AY]  and  [OY]  Vowels  with  Context  (98) 

AY1  AY 2  /  &  V,  V-,  defined  above  (24) 

AY1  AY 2  /  &  C2  C2  defined  above  (25) 

0Y1  OY2  /  &  V-,  V,  defined  above  (24) 

0Y1  0Y2  /  &  C2  C2  defined  above  (25) 


Miscellaneous  Diphones  (10) 

SIR  K  /  3  &  c6  c6  =  L  R  W  (3) 

SIP  P  /  S  &  C7  C?  5  L  R  (2) 

SIT  T  /  S  &  R  (1) 

-  Q  (1) 

Q  AR  (1) 

IH  DX  (1) 

DX  EY  (1) 
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APPENDIX  B  ALPHABETIZED 


AA 

-  AE 

-  AH 

AY1 

-  DH 

-  EH 

F 

-  HH 

-  IH 

M 

-  N 

-  OR 

R 

-  S 

-  SH 

SIG 

-  SIJ 

-  SIK 

UH 

-  UR 

-  uw 

YU 

-  z 

-  ZH 

AA 

AH 

AA 

AO 

AA 

AR 

AA 

AY1 

AA 

DH 

AA 

EH 

AA 

EYR 

AA 

F 

AA 

HH 

AA 

IY 

AA 

L 

AA 

M 

AA 

OW 

AA 

OY1 

AA 

R 

AA 

SIC 

AA 

SID 

AA 

SIG 

AA 

SIT 

AA 

TH 

AA 

UH 

AA 

W 

AA 

Y 

AA 

YU 

AE 

EL 

AE 

F 

AE 

HH 

AE 

NX 

AE 

R 

AE 

S 

AE 

SID 

AE 

SIG 

AE 

SIJ 

AE 

TH 

AE 

V 

AE 

w 

AH 

DH 

AH 

EL 

AH 

F 

AH 

N 

AH 

NX 

AH 

R 

AH 

SIC 

AH 

SID 

AH 

SIG 

AH 

SIT 

AH 

TH 

AH 

V 

AH 

ZH 

AO 

- 

AO 

AA 

AO 

AR 

AO 

AW 

AO 

AX 

AO 

EH 

AO 

EL 

AO 

ER 

AO 

HH 

AO 

IH 

AO 

IR 

AO 

M 

AO 

N 

AO 

NX 

AO 

R 

AO 

S 

AO 

SH 

AO 

SIG 

AO 

SIJ 

AO 

SIK 

AO 

UH 

AO 

UR 

AO 

UW 

AO 

YU 

AO 

Z 

AO 

ZH 

AR 

HH 

AR 

L 

AR 

M 

AR 

S 

AR 

SH 

AR 

SIB 

AR 

SIJ 

AR 

SIK 

AR 

SIP 

AR 

W 

AR 

Y 

AR 

z 

AW 

AE 

AW 

AH 

AW 

AO 

AW 

AXR 

AW 

AY1 

AW 

DH 

AW 

EY 

AW 

EYR 

AW 

F 

AW 

IX 

AW 

IY 

AW 

L 

AW 

OR 

AW 

OW 

AW 

OY1 

AW 

SIB 

AW 

SIC 

AW 

SID 

AW 

SIP 

AW 

SIT 

AW 

TH 

LIST  OF  DIPHONES 


-  AO 

-  AR 

-  AW 

-  ER 

-  EY 

-  EYR 

-  IR 

-  IY 

-  L 

-  OW 

-  OY1 

-  Q 

-  SIB 

-  SIC 

-  SID 

-  SIP 

-  SIT 

-  TH 

-  V 

-  w 

-  Y 

AA  - 

AA  AA 

AA  AE 

AA  AW 

AA  AX 

AA  AXR 

AA  EL 

AA  ER 

AA  EY 

AA  IH 

AA  IR 

AA  IX 

AA  N 

AA  NX 

AA  OR 

AA  S 

AA  SH 

AA  SIB 

AA  SIJ 

AA  SIK 

AA  SIP 

AA  UR 

AA  UW 

AA  V 

AA  Z 

AA  ZH 

AE  DH 

AE  L 

AE  M 

AE  N 

AE  SH 

AE  SIB 

AE  SIC 

AE  SIK 

AE  SIP 

AE  SIT 

AE  Y 

AE  Z 

AE  ZH 

AH  HH 

AH  L 

AH  M 

AH  S 

AH  SH 

AH  SIB 

AH  SIJ 

AH  SIK 

AH  SIP 

AH  W 

AH  Y 

AH  Z 

AO  AE 

AO  AH 

AO  AO 

AO  AXR 

AO  AY1 

AO  DH 

AO  EY 

AO  EYR 

AO  F 

AO  IX 

AO  IY 

AO  L 

AO  OR 

AO  OW 

AO  OY1 

AO  SIB 

AO  SIC 

AO  SID 

AO  SIP 

AO  SIT 

AO  TH 

AO  V 

AO  W 

AO  Y 

AR  DH 

AR  EL 

AR  F 

AR  N 

AR  NX 

AR  R 

AR  SIC 

AR  SID 

AR  SIG 

AR  SIT 

AR  TH 

AR  V 

AR  ZH 

AW  - 

AW  AA 

AW  AR 

AW  AW 

AW  AX 

AW  EH 

AW  EL 

AW  ER 

AW  HH 

AW  IH 

AW  IR 

AW  M 

AW  N 

AW  NX 

AW  R 

AW  S 

AW  SH 

AW  SIG 

AW  SIJ 

AW  SIK 

AW  UH 

AW  UR 

AW  UW 
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AW  V 

AW  W 

AX  DH 

AX  EL 

AX  N 

AX  NX 

AX  SIC 

AX  SID 

AX  SIT 

AX  TH 

AX  ZH 

AXR  DH 

AXR  M 

AXR  N 

AXR  SIB 

AXR  SIC 

AXR  SIP 

AXR  SIT 

AXR  Z 

AXR  ZH 

AY1  AY 2 

/ 

&  AA 

AY1  AY2 

/ 

&  AO 

AY1  AY 2 

/ 

&  AX 

AY1  AY 2 

/ 

&  DH 

AY1  AY 2 

/ 

&  ER 

AY1  AY2 

/ 

&  F 

AY1  AY 2 

/ 

&  IR 

AY1  AY2 

/ 

&  L 

AY1  AY2 

/ 

&  NX 

AY1  AY2 

/ 

&  OY1 

AY1  AY2 

/ 

&  SH 

AY1  AY 2 

/ 

&  SID 

AY1  AY 2 

/ 

&  SIK 

AY1  AY2 

/ 

&  TH 

AY1  AY2 

/ 

&  UW 

AY1  AY2 

/ 

&  Y 

AY1  AY2 

/ 

&  ZH 

AY  2  - 

AY2  AA 

AY  2  AW 

AY  2  AX 

AY2  EL 

AY 2  ER 

AY 2  IH 

AY 2  IR 

AY  2  N 

AY  2  NX 

AY  2  S 

AY2  SH 

AY 2  SIJ 

AY2  SIK 

AY2  UR 

AY  2  UW 

AY  2  Z 

AY 2  ZH 

B  AC 

B  AR 

B  DH 

B  EH 

B  HH 

B  IH 

B  M 

B  N 

B  S 

B  SH 

B  SIJ 

B  SIK 

B  UR 

B  UW 

B  Z 

B  ZH 

CH  AO 

CH  AR 

CH  DH 

CH  EH 

CH  HH 

CH  IH 

AW  Y 

AW  YU 

AX  F 

AX  HH 

AX  R 

AX  S 

AX  SIG 

AX  SIJ 

AX  V 

AX  W 

AXR  EL 

AXR  F 

AXR  NX 

AXR  R 

AXR  SID 

AXR  SIG 

AXR  TH 

AXR  V 

AY1  AY 2 

AY1  AY 2 

/ 

&  AE 

AY1  AY2 

/ 

&  AR 

AY1  AY 2 

/ 

&  AXR 

AY1  AY2 

/ 

&  EH 

AY1  AY 2 

/ 

&  EY 

AY1  AY 2 

/ 

&  HH 

AY1  AY 2 

/ 

&  IX 

AY1  AY 2 

/ 

&  M 

AY1  AY 2 

/ 

&  OR 

AY1  AY 2 

/ 

&  R 

AY1  AY 2 

/ 

&  SIB 

AY1  AY 2 

/ 

&  SIG 

AY1  AY 2 

/ 

&  SIP 

AY1  AY 2 

/ 

&  UH 

AY1  AY2 

/ 

&  V 

AY1  AY 2 

/ 

&  YU 

AY  2  AE 

AY2  AH 

AY2  AXR 

AY2  AY1 

AY  2  EY 

AY2  EYR 

AY 2  IX 

AY2  IY 

AY2  OR 

AY  2  OW 

AY 2  SIB 

AY2  SIC 

AY 2  SIP 

AY 2  SIT 

AY2  V 

AY2  W 

B  - 

B  AA 

B  AW 

B  AX 

B  ER 

B  EY 

B  IR 

B  IX 

B  OR 

B  OW 

B  SIB 

B  SIC 

B  SIP 

B  SIT 

B  V 

B  W 

CH  - 

CH  AA 

CH  AW 

CH  AX 

CH  ER 

CH  EY 

CH  IR 

CH  IX 

AW  Z 

AW  ZH 

AX  L 

AX  M 

AX  SH 

AX  SIB 

AX  SIK 

AX  SIP 

AX  Y 

AX  Z 

AXR  HH 

AXR  L 

AXR  S 

AXR  SH 

AXR  SIJ 

AXR  SIK 

AXR  W 

AXR  Y 

AY1  AY 2 

/ 

&  AH 

AY1  AY2 

/ 

&  AW 

AY1  AY2 

/ 

&  AY1 

AY1  AY2 

/ 

&  EL 

AY1  AY 2 

/ 

&  EYR 

AY1  AY 2 

/ 

&  IH 

AY1  AY 2 

/ 

&  IY 

AY1  AY 2 

/ 

&  N 

AY1  AY2 

/ 

&  OW 

AY1  AY 2 

/ 

&  S 

AY1  AY 2 

/ 

&  SIC 

AY1  AY 2 

/ 

&  SIJ 

AY1  AY 2 

/ 

&  SIT 

AY1  AY 2 

/ 

&  UR 

AY1  AY 2 

/ 

&  W 

AY1  AY2 

/ 

&  Z 

AY  2  AO 

AY  2  AR 

AY2  DH 

AY 2  EH 

AY  2  F 

AY  2  HH 

AY  2  L 

AY  2  M 

AY2  OY1 

AY  2  R 

AY 2  SID 

AY 2  SIG 

AY  2  TH 

AY  2  UH 

AY2  Y 

AY2  YU 

B  AE 

B  AH 

B  AXR 

B  AY1 

B  EYR 

B  F 

B  IY 

B  L 

B  OY1 

B  R 

B  SID 

B  SIG 

B  TH 

B  UH 

B  Y 

B  YU 

CH  AE 

CH  AH 

CH  AXR 

CH  AY1 

CH  EYR 

CH  F 

CH  IY 

CH  L 
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CH  M 

CH  N 

CH  OR 

CH  OW 

CH  OYl 

CH  R 

CH  S 

CH  SH 

CH  SIB 

CH  SIC 

CH  SID 

CH  SIG 

CH  SIJ 

CH  SIR 

CH  SIP 

CH  SIT 

CH  TH 

CH  UH 

CH  UR 

CH  UW 

CH  V 

CH  W 

CH  Y 

CH  YU 

CH  Z 

CH  ZH 

D  - 

D  AA 

D  AE 

D  AH 

D  AO 

D  AR 

D  AW 

D  AX 

D  AXR 

D  AY1 

D  DH 

D  EH 

D  ER 

D  EY 

D  EYR 

D  F 

D  HH 

D  IH 

D  IR 

D  IX 

D  IY 

D  L 

D  M 

D  N 

D  OR 

D  OW 

D  OYl 

D  R 

D  S 

D  SH 

D  SIB 

D  SIC 

D  SID 

D  SIG 

D  SIJ 

D  SIR 

D  SIP 

D  SIT 

D  TH 

D  UH 

D  UR 

D  UW 

D  V 

D  W 

D  Y 

D  YU 

D  Z 

D  ZH 

DH  - 

DH  AA 

DH  AE 

DH  AH 

DH  AO 

DH  AR 

DH  AW 

DH  AX 

DH  AXR 

DH  AY1 

DH  DH 

DH  EH 

DH  ER 

DH  EY 

DH  EYR 

DH  F 

DH  HH 

DH  IH 

DH  IR 

DH  IX 

DH  IY 

DH  L 

DH  M 

DH  N 

DH  OR 

DH  OW 

DH  OYl 

DH  R 

DH  S 

DH  SH 

DH  SIB 

DH  SIC 

DH  SID 

DH  SIG 

DH  SIJ 

DH  SIR 

DH  SIP 

DH  SIT 

DH  TH 

DH  UH 

DH  UR 
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